34 research outputs found

    Evaluation of deep neural networks for reduction of credit card fraud alerts

    Get PDF
    Fraud detection systems support advanced detection techniques based on complex rules, statistical modelling and machine learning. However, alerts triggered by these systems still require expert judgement to either confirm a fraud case or discard a false positive. Reducing the number of false positives that fraud analysts investigate, by automating their detection with computer-assisted techniques, can lead to significant cost efficiencies. Alert reduction has been achieved with different techniques in related fields like intrusion detection. Furthermore, deep learning has been used to accomplish this task in other fields. In our paper, a set of deep neural networks have been tested to measure their ability to detect false positives, by processing alerts triggered by a fraud detection system. The performance achieved by each neural network setting is presented and discussed. The optimal setting allowed to capture 91.79% of total fraud cases with 35.16% less alerts. Obtained alert reduction rate would entail a significant reduction in cost of human labor, because alerts classified as false positives by the neural network wouldn't require human inspection

    Integrating descriptions of knowledge management learning activities into large ontological structures: A case study

    Get PDF
    Ontologies have been recognized as a fundamental infrastructure for advanced approaches to Knowledge Management (KM) automation, and the conceptual foundations for them have been discussed in some previous reports. Nonetheless, such conceptual structures should be properly integrated into existing ontological bases, for the practical purpose of providing the required support for the development of intelligent applications. Such applications should ideally integrate KM concepts into a framework of commonsense knowledge with clear computational semantics. In this paper, such an integration work is illustrated through a concrete case study, using the large OpenCyc knowledge base. Concretely, the main elements of the Holsapple & Joshi KM ontology and some existing work on e-learning ontologies are explicitly linked to OpenCyc definitions, providing a framework for the development of functionalities that use the built-in reasoning services of OpenCyc in KM ctivities. The integration can be used as the point of departure for the engineering of KM-oriented systems that account for a shared understanding of the discipline and rely on public semantics provided by one of the largest open knowledge bases available

    Comparing social media and Google to detect and predict severe epidemics

    Get PDF
    Internet technologies have demonstrated their value for the early detection and prediction of epidemics. In diverse cases, electronic surveillance systems can be created by obtaining and analyzing on-line data, complementing other existing monitoring resources. This paper reports the feasibility of building such a system with search engine and social network data. Concretely, this study aims at gathering evidence on which kind of data source leads to better results. Data have been acquired from the Internet by means of a system which gathered real-time data for 23 weeks. Data on infuenza in Greece have been collected from Google and Twitter and they have been compared to infuenza data from the ofcial authority of Europe. The data were analyzed by using two models: the ARIMA model computed estimations based on weekly sums and a customized approximate model which uses daily sums. Results indicate that infuenza was successfully monitored during the test period. Google data show a high Pearson correlation and a relatively low Mean Absolute Percentage Error (R=0.933, MAPE=21.358). Twitter results are slightly better (R=0.943, MAPE=18.742). The alternative model is slightly worse than the ARIMA(X) (R=0.863, MAPE=22.614), but with a higher mean deviation (abs. mean dev: 5.99% vs 4.74%)

    On the graph structure of the Web of Data

    Get PDF
    This article describes how the Web of Data has emerged as the realization of a machine readable web relying on the resource description framework language as a way to provide richer semantics to datasets. While the web of data is based on similar principles as the original web, being interlinked in the principal mechanism to relate information, the differences in the structure of the information is evident. Several studies have analysed the graph structure of the web, yielding important insights that were used in relevant applications. However, those findings cannot be transposed to the Web of Data, due to fundamental differences in the production, link creation and usage. This article reports on a study of the graph structure of the Web of Data using methods and techniques from similar studies for the Web. Results show that the Web of Data also complies with the theory of the bow-tie. Other characteristics are the low distance between nodes or the closeness and degree centrality are low. Regarding the datasets, the biggest one is Open Data Euskadi but the one with more connections to other datasets is Dbpedia.European Commissio

    Detecting browser drive-by exploits in images using deep learning

    Get PDF
    Steganography is the set of techniques aiming to hide information in messages as images. Recently, stenographic techniques have been combined with polyglot attacks to deliver exploits in Web browsers. Machine learning approaches have been proposed in previous works as a solution for detecting stenography in images, but the specifics of hiding exploit code have not been systematically addressed to date. This paper proposes the use of deep learning methods for such detection, accounting for the specifics of the situation in which the images and the malicious content are delivered using Spatial and Frequency Domain Steganography algorithms. The methods were evaluated by using benchmark image databases with collections of JavaScript exploits, for different density levels and steganographic techniques in images. A convolutional neural network was built to classify the infected images with a validation accuracy around 98.61% and a validation AUC score of 99.75%

    Traceability for trustworthy AI: a review of models and tools

    Get PDF
    Traceability is considered a key requirement for trustworthy artificial intelligence (AI), related to the need to maintain a complete account of the provenance of data, processes, and artifacts involved in the production of an AI model. Traceability in AI shares part of its scope with general purpose recommendations for provenance as W3C PROV, and it is also supported to different extents by specific tools used by practitioners as part of their efforts in making data analytic processes reproducible or repeatable. Here, we review relevant tools, practices, and data models for traceability in their connection to building AI models and systems. We also propose some minimal requirements to consider a model traceable according to the assessment list of the High-Level Expert Group on AI. Our review shows how, although a good number of reproducibility tools are available, a common approach is currently lacking, together with the need for shared semantics. Besides, we have detected that some tools have either not achieved full maturity, or are already falling into obsolescence or in a state of near abandonment by its developers, which might compromise the reproducibility of the research trusted to them

    Evolution and prospects of the Comprehensive R Archive Network (CRAN) package ecosystem

    Get PDF
    Free and open source software package ecosystems have existed for a long time, but such collaborative development practice has surged in recent years. One of the oldest and most popular package ecosystems is Comprehensive R Archive Network (CRAN), the repository of packages of the statistical language R, a popular statistical computing environment. CRAN stores a large number of packages that are updated regularly and depend on many other packages in a complex graph of relations. As the repository grows, its sustainability could be threatened by that complexity or nonuniform evolution of some packages. This paper provides an empirical analysis of the evolution of the CRAN repository in the last 20 years, considering the laws of software evolution and the effect of CRAN's policies on such development. Results show how the progress of CRAN is consistent with the laws of continuous growth and change and how there seems to be a relevant increase in complexity in recent years. Significant challenges are raising related to the scale and scope of software package managers and the services they provide; understanding how they change over time and what might endanger their sustainability are key factors for their future improvement, maintenance, policies, and, eventually, sustainability of the ecosystem

    Predicting length of stay across hospital departments

    Get PDF
    The length of hospital stay and its implications have a significant economic and human impact. As a consequence, the prediction of that key parameter has been subject to previous research in recent years. Most previous work has analysed length of stay in particular hospital departments within specific study groups, which has resulted in successful prediction rates, but only occasionally reporting predictive patterns. In this work we report a predictive model for length of stay (LOS) together with a study of trends and patterns that support a better understanding on how LOS varies across different hospital departments and specialties. We also analyse in which hospital departments the prediction of LOS from patient data is more insightful. After estimating predictions rates, several patterns were found; those patterns allowed, for instance, to determine how to increase prediction accuracy in women admitted to the emergency room for enteritis problems. Overall, concerning these recognised patterns, the results are up to 21.61% better than the results with baseline machine learning algorithms in terms of error rate calculation, and up to 23.83% in terms of success rate in the number of predicted which is useful to guide the decision on where to focus attention in predicting LOS

    Authority-based conversation tracking in Twitter: an unattended methodological approach

    Get PDF
    Twitter is undoubtedly one of the most widely used data sources to analyze human communication. The literature is full of examples where Twitter is accessed, and data are downloaded as the previous step to a more in-depth analysis in a wide variety of knowledge areas. Unfortunately, the extraction of relevant information from the opinions that users freely express in Twitter is complicated, both because of the volume generated—more than 6000 tweets per second—and the difficulties related to filtering out only what is pertinent to our research. Inspired by the fact that a large part of users use Twitter to communicate or receive political information, we created a method that allows for the monitoring of a set of users (which we will call authorities) and the tracking of the information published by them about an event. Our approach consists of dynamically and automatically monitoring the hottest topics among all the conversations where the authorities are involved, and retrieving the tweets in connection with those topics, filtering other conversations out. Although our case study involves the method being applied to the political discussions held during the Spanish general, local, and European elections of April/May 2019, the method is equally applicable to many other contexts, such as sporting events, marketing campaigns, or health crises

    Modeling Bacterial Species: Using Sequence Similarity with Clustering Techniques

    Get PDF
    Existing studies have challenged the current definition of named bacterial species, especially in the case of highly recombinogenic bacteria. This has led to considering the use of computational procedures to examine potential bacterial clusters that are not identified by species naming. This paper describes the use of sequence data obtained from MLST databases as input for a k-means algorithm extended to deal with housekeeping gene sequences as a metric of similarity for the clustering process. An implementation of the k-means algorithm has been developed based on an existing source code implementation, and it has been evaluated against MLST data. Results point out to potential bacterial clusters that are close to more than one different named species and thus may become candidates for alternative classifications accounting for genotypic information. The use of hierarchical clustering with sequence comparison as similarity metric has the potential to find clusters different from named species by using a more informed cluster formation strategy than a conventional nominal variant of the algorithm
    corecore